R for Lunch

Data wrangling with {dplyr}

John Little

Duke University Libraries

Center for Data & Visualization Sciences

2024-01-17

Today’s topics

  • Five essential {dplyr} data wrangling verbs

  • Data pipes inside code-chunks

Yesterday (video)

  • Import data

  • Tour of RStudio IDE

  • Coding notebooks (Quarto)

Housekeeping

  • Drew / Lauren / breakout rooms
  • CDVS
    • Themes
      • Data Management (Plans, Reproducibility, Repositories)

      • Data Science

      • Data Visualization

      • GIS and Spatial Analysis

      • Data Sources

Housekeeping continued

R for Lunch as a series

R for Lunch is a series that meets 8 times (till EOM Feb.) After today it will meet regularly on Thursdays at noon.

  • Sign-up for each workshop individually

  • Each episode has a unique zoom link

Eat yor own dog food


Model how R can work for practical reproducible workflows

Definitions

Opinionated

Tidyverse and Quarto is the most practical and well developed, reproducible, scientific analysis and publishing workflow available.

Tidyverse and Tidy data is an important foundation.

  • Every row is a single observation
  • Every column is a variable
  • The cells are single data values

Tidy data

Tidy data1

Tidy data

  • Every row is a single observation
  • Every column is a variable
  • The cells are single data values

Wide data

Code
library(tidyverse)
library(gt)
library(gtExtras)

tidyr::relig_income |> 
  gt::gt_preview() |> 
  gtExtras::gt_theme_dark()
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don't know/refused
1 Agnostic 27 34 60 81 76 137 122 109 84 96
2 Atheist 12 27 37 52 35 70 73 59 74 76
3 Buddhist 27 21 30 34 33 58 62 39 53 54
4 Catholic 418 617 732 670 638 1116 949 792 633 1489
5 Don’t know/refused 15 14 15 11 10 35 21 17 18 116
6..17
18 Unaffiliated 217 299 374 365 341 528 407 321 258 597

Tall data

Code
relig_income |> 
  pivot_longer(cols = -religion, 
               names_to = "income_category", 
               values_to = "income") |> 
  gt::gt_preview() |> 
  gtExtras::gt_theme_dark()
religion income_category income
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6..179
180 Unaffiliated Don't know/refused 597
Code
relig_income |> 
  pivot_longer(cols = -religion, 
               names_to = "income_category", 
               values_to = "income") |> 
  mutate(religion = fct_relevel(religion, "Evangelical Prot", "Mainline Prot", "Catholic", "Unaffiliated", "Historically Black Prot")) |> 
  mutate(income_category = fct_rev(as_factor(income_category))) |>
  ggplot(aes(income, income_category)) +
  geom_col(fill = "#eee8d5") +
  facet_wrap(vars(
    fct_other(
      religion, 
      keep = c("Evangelical Prot", "Mainline Prot", "Catholic", "Unaffiliated", "Historically Black Prot")))) +
  theme(plot.background = element_rect(fill = "#002b36"),
        text = element_text(color = "#eee8d5"),
        axis.text = element_text(color = "#eee8d5"), 
        panel.background = element_rect(fill = "#002b36"),
        panel.grid = element_line(color = "#002b36"),
        strip.background = element_rect(fill = "#7b9c9f"))

Code

 

relig_income |> 
  pivot_longer(cols = -religion, names_to = "income_category") |> 
  ggplot(aes(value, income_category)) +
  geom_col() +
  facet_wrap(vars(religion))

Image Credit: apreshill | CC BY 4.0 | https://github.com/apreshill/teachthat/blob/master/pivot/pivot_longer_smaller.gif]

Polls

We are here to help

  • askData@duke.edu

  • https://library.duke.edu/data

  • https://is.gd/littleconsult

Let’s do it

Two things for today

Exercises

  1. https://intro2r.library.duke.edu/ > Exercises > Link out > Green Code button > Download ZIP

  2. Then, Unzip (i.e. Expand) the folder (on your local file system)

  3. Then, double click the rforlunch_exercises.Rproj file

  4. From RStudio the Files tab, open the 01_dplyr.qmd

    • The answer file is in the RStudio rforlunch_exercises project > Files Tab > Answers folder

Closing

Pipes and Assignments

 

Operator Operator Name Keystore Pnuemonic
<- assignment Alt-dash “Gets value from”

|>
or

%>%

pipe Ctrl-Shift-M “And then”

Citation management

 

RStudio > Quarto Notebook > Insert > Citation

Example DOI: 10.18637/jss.v059.i10

ai-paired coding

 

  • Data science concepts: Microsoft copilot (“More precise” setting)

  • Code completion: GitHub copilot and RStudio (IDE) or VSCode (IDE)

Bye for now